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Abstract 

Many template-based modeling (TBM) methods have been developed over the recent years that allow for protein structure 
prediction and for the study of structure-function relationships for proteins. One major problem all TBM algorithms face, 
however, is their unsatisfactory performance when proteins under consideration are low-homology. To improve the 
performance of TBM methods for such targets, a novel model evaluation method was developed here, and named MEFTop. 
Our novel method focuses on evaluating the topology by using two novel groups of features. These novel features included 
secondary structure element (SSE) contact information and 3-dimensional topology information. By combining MEFTop 
algorithm with FR-t5, a threading program developed by our group, we found that this modified TBM program, which was 
named FR-t5-M, exhibited significant improvements in predictive abilities for low-homology protein targets. We further 
showed that the MEFTop could be a generalized method to improve threading programs for low-homology protein targets. 
The softwares (FR-t5-M and MEFTop) are available to non-commercial users at our website: http://jianglab.ibp.ac.cn/lims/ 
FRt5M/FRt5M.html. 
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Introduction 

Template-based modeling is defined as modeling of protein 
structures based on already determined structure templates, and it 
is currently the most powerful prediction method. To build a 
structure model for a target sequence, the TBM method usually 
follows four steps: identification of structural templates, alignment 
of the target sequence to structural templates (or sequence- 
structure alignment), model building, and model quality evalua- 
tion. In recent years, various TBM programs were developed for 
the first two steps [1,2,3,4,5,6,7,8,9,10]. In addition, powerful 
model building tools were developed, including MODELLER and 
SWISS-MODEL [11,12]. Lastly, a wide range of tools was 
developed for the last step, the model quality evaluation 
[13,14,15,16,17,18,19,20,21,22,23,24]. 

Whilst TBM methods are now widely used for protein structure 
prediction and structure-function relationship studies, their low 
performance for low-homology proteins still presents a bottieneck. 
The underlying reasons behind the bottleneck can be complicated, 
and include issues like incorrect template selection and sequence- 
template alignment, modeling errors, or a biased scoring function, 
to name a few. All together, these errors ultimately result in the 



failure of generating high-quality models, even in the presence of 
good templates in the template library at use. 

Our previously developed TBM method FR-t5 [7], which has 
comparable performance to the state-of-the-art fold recognition 
methods, faces the same problem. In FR-t5, the targets in the 
dawn region (defined as proteins that have an optimal Z-score < 
6.0), the ranked 1 st models in FR-t5 are always of native-unlike 
topology for the target sequence, even though native-like models 
exist in the searching space. These proteins in dawn region are 
low-confidence targets for FR-t5, which included a significant 
portion of low-homology proteins. In consideration of more 
conserved features derived directly from a structure model, model 
evaluation method could provide an avenue to improve the 
performance of TBM in the dawn region. Here we report a novel 
model evaluation method called MEFTop that combines tradi- 
tional features with two groups of newly introduced structural 
features. The obtained testing results indicate that these novel 
structural features contribute significantly to the improvement of 
MEFTop performance in the dawn region. We further show that 
MEFTop could be combined with FR-t5 and other threading 
programs to improve the low-homology protein modeling. 
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Results 

In this section, we will first show the performance improvement 
of MEFTop for protein targets in the dawn region. Then, we will 
analyze the contribution of newly introduced structural features in 
MEFTop. Thirdly, we will explore how the FR-t5-M, the 
combination of MEFTop with FR-t5, improves modeling for 
targets in the dawn region and test its performance on C ASP 10 
targets. Finally, we will demonstrate the application of MEFTop to 
some other threading algorithms such as RaptorX [10] and 
SPARKS-X [9]. 

The Performance of MEFTop in the Dawn Region 

To evaluate the robustness of MEFTop, a 5-fold cross- 
validation was carried out on the training set SCOP1.75-Z6. 
The average and standard deviation of the percentage of native- 
like Topi models (Topl%) (see Methods section for details) was 
46.84%±2.55%, which indicates stable performance of MEFTop 
for targets in the dawn region. The performance of MEFTop was 
further tested on the data set (SCOP1. 75-500) which included 110 
proteins in dawn region. The Topl% selected with the P-score of 
MEFTop was compared to that selected with the Z-score of FR-t5. 
As shown in Figure 1A, we found that the Topl% selected 
according to the P-score was higher when the best Z-score cutoff of 
targets was used as 4.0 or 5.0, but somewhat lower when targets 
had an optimal Z-score less than 6.0. Furthermore, in order to 
evaluate the models selected according to P-score and Z-score, we 
compared the TM-score [25] of Topi models according to two 
metrics for targets with an optimal Z-score <5.0 on the 
SCOP1. 75-500 set (Figure IB). Of 63 targets on the testing set, 
there were 33 Topi models with better quality selected according 
to the P-score, while 22 Topi models with better quality selected 
according to Z-score. These results indicate that better perfor- 
mance for protein modeling can be achieved for targets in the 
dawn region using the P-score of the MEFTop method than using 
the Z-score of the FR-t5 method. 



The Contribution of Newly Introduced Structural 
Features in MEFTop 

To investigate the contribution of newly introduced structural 
features of MEFTop to the improvement of model evaluation for 
dawn region proteins, different combinations of features were 
trained on the SCOP1.75-Z6 set and tested on the SCOP1.75- 
500 set. As shown in Table 1, two groups of structural features, 
including SSE contact features and 3-dimensional topology 
features, contributed significantly to the improvement of the 
MEFTop method. When SSE and 3D topology features were 
added separately, for the targets with an optimal Z-score less than 
6.0, the Topl% increased to 56.4% (SSE) and 58.2% (3D 
topology) as compared to 53.6% when only the traditional features 
were considered. Similar improvements were also observed for the 
targets with optimal Z-score less than 4.0 and 5.0. As expected, 
after incorporating the two groups of structural features with 
traditional features, the Top 1 % increased more significandy, from 
53.6% to 62.7% for the targets with an optimal Z-score less than 
6.0. 

The Combination of MEFTop with FR-t5, Denoted as FR- 
t5-M, Significantly Improves the FR-t5 in the Dawn 
Region 

As shown in Figure IB, although overall the P-score of 
MEFTop outperforms Z-score of FR-t5 in model selection for 
the targets in the dawn region, the two metrics apparently showed 
complementarity. Thus we sought to integrate the two metrics 
(denoted as M-score) to achieve a better performance of protein 
prediction by combining the methods MEFTop and FR-t5 
(denoted as FR-t5-M) (see Methods section for detailed descrip- 
tion). 

To evaluate FR-t5-M, we compared the performance of M- 
score and Z-score for the 1 1 0 targets in the dawn region of 
SCOP1. 75-500 (Table 2). From the data presented in Table 2, it 
is evident that the M-score outperformed the Z-score for all 
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Figure 1 . Comparison of the performance of MEFTop and FR-t5 in the dawn region of SCOP1 .75-500 set. (A) The percentage of native- 
like Topi models (Top1%) that selected by MEFTop using P-score and FR-t5 using Z-score. The X-axis is the Z-score cutoff and the Y-axis is the Top1%. 
The performances of Z-score and P-score are shown as white and black columns, respectively. (B) The TM-score of Topi models selected according to 
Z-score and P-score for 63 targets with optimal Z-score <5.0. The X-axis and Y-axis of each point represent the TM-scores of Topi models selected by 
Z-score and P-score, respectively. 
doi:1 0.1 371 /journal.pone.0089935.g001 
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Table 1. Testing results for the contribution of structural 
features to MEFTop in the dawn region on SCOP1. 75-500 set. 



SCOP1.75-500(Top1%) 


Feature 


<6.0* 


<5.0* 


<4.0* 


T 


53.6 


36.5 


23.9 


SSE 


38.2 


27.0 


15.2 


Topology 


43.6 


28.6 


23.9 


T+SSE 


56.4 


41.2 


28.3 


T+Topology 


58.2 


44.4 


30.4 


SSE+Toplogy 


44.5 


30.2 


23.9 


All (T+SSE+Topology) 


62.7 


42.9 


30.4 



•Targets with optimal Z-score less than this cutoff value (6.0 or 5.0 or 4.0). On 
the SCOP1.75-500 set, the numbers of targets are 110 (Z-score<6.0), 63(Z- 
score<5.0) and 46(Z-score<4.0). 
doi:1 0.1 371 /journal.pone.0089935.t001 

criteria listed. For instance, the average rank of Topi models (see 
Methods section for details) was 9.14 for the M-score, whereas it 
was 11.48 for the Z-score. Figure 2 gives a more detailed 
comparison of the two methods by looking at the quality of Top 1 
models according to TM-score. Notably, FR-t5-M could find 
high-quality models for 7 low homology proteins (marked by 
triangles), whereas FR-t5 could not. Four of these 7 low homology 
proteins were illustrated in Figure 3. One example is a bacterial 
immunity domain d2bl8cl containing 81 amino acids (AA). The 
Topi model selected by FR-t5-M (M-score) has a TM-score of 
0.728, which is in higher quality than the model selected by the 
FR-t5 (Z-score) (TM-score = 0.300). The other three examples are 
dlb33n_ (67 AA), d2rdebl (1 10 AA) and dlsgkal (155 AA). Their 
Topi models selected by the FR-t5-M (M-score) were all native- 
like, whereas models selected by the FR-t5 (Z-score) were native- 
unlike. These differences in model selection between M-score and 
Z-score revealed that structural features clearly contributed in 
model evaluation and selection. As shown in Figure 3, all Topi 
models selected according to their Z-score also had similar SSEs 
type to native structures, whereas the topology relationship 
between these SSEs was not correct. However, the MEFTop 
algorithm corrected for this error through utilizing the SSE 
contact map and introducing topological constraints. 

Since a significant portion of low-homology proteins were 
included in the dawn regions, we further compared FR-t5-M and 
FR-t5 on these low-homology proteins. Of 1 10 proteins in dawn 
region of SCOP1. 75-500, 59 have sequence identity less than 
40%. As shown in Table 3, for these 59 targets, the average rank of 
Topi models and Topi % were 13.49 and 52.5% for M-score, and 
17.72 and 42.4% for Z-score, respectively. The similar improve- 
ment was also observed in 25 proteins whose sequence identity less 
than 30%. 

The FR-t5-M was also evaluated on the 390 targets of high 
confidence from the SCOP1. 75-500 dataset (Table SI). We found 
that the two methods exhibited similar performances for high- 
confidence targets. 

We further tested the performance of FR-t5-M on targets of the 
recent CASP10. A comprehensive comparison between the 
performance of FR-t5-M (M-score) and FR-t5 (Z-score) on the 
103 targets of CASP10 data set is shown in Table 4. Overall, FR- 
t5-M outperformed FR-t5 as measured by average rank (9.00 vs 
10.46) and average TM-score (0.570 vs 0.564). Notably, the 
improvement was contributed by dawn region targets. For the 57 
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Figure 2. The TM-score of Topi models selected according to 
Z-score and M-score for all targets with optimal Z-score <6.0 
on SCOP1. 75-500 set. The X-axis and Y-axis of each point represent 
the TM-score of Topi models selected according to Z-score and M- 
score, respectively. Low homology proteins (marked by triangles) had 
high-quality Topi models by FR-t5-M (M-score) whereas not FR-t5 (Z- 
score). 

doi:10.1371/journal.pone.0089935.g002 

targets in dawn region, the average ranks for FR-t5-M and FR-t5 
were 12.08 and 14.15 respectively. 

The Integration of MEFTop with other Threading 
Methods 

Here we would like to demonstrate that the MEFTop could 
offer a general approach to improve protein modeling by 
combining it with another two popular threading programs, 
RaptorX and SPARKS-X. These two integrated methods 
RaptorX-M and SPARKS-X-M were tested on the 110 targets 
in the dawn region of SCOP1. 75-500. As shown in Table 5 and 6, 
both integrated methods were significantly improved. For 
RaptorX-M, the Topl% increased from 76.0% to 78.8%. 

We further looked into the performance of the newly integrated 
methods (RaptorX-M and SPARKSX-M) on 59 low-homology 
targets (sequence identity less than 40%) in SCOP1. 75-500. From 
the data presented in Table 7, for RaptorX-M, the Topl% 
increased from 63.2% to 66.7%. 

Discussion 

In order to improve low-homology protein modeling, we have 
developed a useful model evaluation method (MEFTop) by focusing 
on evaluating the native-likeness of topology. Further, by incorpo- 
rating MEFTop with our previously developed threading method 
FR-t5, a new TBM method (FR-t5-M) was developed. We found 
that FR-t5-M significantly outperforms our previous threading 
method FR-t5, and displays a predictive performance for low- 
homology C ASP 1 0 targets that is comparable to most other popular 
protein structure prediction programs. Moreover, we observed 
significant improvements in predicting structures for low-homology 
proteins when combining the MEFTop with RaptorX and 
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Figure 3. Four representative targets with different Topi models selected by FR-t5-M (M-score) and FR-t5 (Z-score). The native 
structure (red) of d1b33n_ (A), d2bl8c1 (B), d2rdeb1 (C) and dlsgkal (D), theTopI model selected by FR-t5-M using M-score (green) and FR-t5 using 
Z-score (cyan) are shown. The TM-scores of Topi models and native structures are presented. 3D structure models are produced with PyMOL (http:// 
www.pymol.org/). 

doi:1 0.1 371 /journal.pone.0089935.g003 



SPARKS-X. Taken together, we argue that MEFTop could offer a 
generalized method to improve threading algorithms for low- 
homology protein modeling. 

A wide range of earlier studies have demonstrated that 
traditional features of ID and 2D information can be effectively 
utilized for high-quality model evaluation [15,19,21]. Our 
research revealed that the integration of SSE contact features 
and 3D topology features into the model evaluation method 
MEFTop greatly increased the quality of model evaluation for 
proteins in the so-called dawn region. The incorporation of these 
two groups of structural features was intended to capture the 
topology structure information during evaluation of the quality of 
models. As shown above, the introduction of these structural 



features significandy improves the percentage of native-like Top 1 
models in the dawn region or for low homology proteins. 

Whilst we have shown that the application of MEFTop or FR- 
t5-M brings significant improvements, both methods can be 
optimized further. First, models of FR-t5-M could be optimized 
with the introduction of side-chain packing and refinement in the 
future. Second, a systematic and complete programming code 
optimization should result in accelerating the program. As a case 
in point, a mutation in the transporter membrane protein 
SLC45A2, which is the genetic basis of the fur color of white 
tigers, was successfully predicted by using FR-t5-M [26]. In 
summary, both our model evaluation method MEFTop and 
improved TBM program FR-t5-M could facilitate a wide range of 
applications. 
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Table 2. Improvements of FR-t5-M over FR-t5 in the dawn 
region on SCOP1. 75-500 set. 



Top1% 

Metrics Average 3 Sum b CC ±a c <6.0 d <5.0 d <4.0 d 

Z-SCOre 11.48 66.41 0.472±0.382 64.5 41.3 26.1 

M-score 9.14 69.08 0.556±0.344 69.1 50.8 37.0 

a The average rank according to TM-score(over 110 decoy sets) in the absence of 
native structures. 

b The sum of TM-scores for Topi models in the dawn region. 
c The average and standard deviation of Pearson correlation coefficients 
between predicted score and TM-score for every target in the dawn region. 
d Targets whose best Z-score is less than the cutoff. On the SCOP1. 75-500 set, 
the number of targets is 1 10(Z-score<6.0), 63(Z-score<5.0) and 46(Z-score< 
4.0), respectively. 

doi:1 0.1 371 /journal.pone.0089935.t002 

Materials and Methods 

Data Set 

The CASP7-8 data set was used as training data, which consists 
of 221 CASP7 and CASP8 targets (http://predictioncenter.org/). 
The CASP10 data set was used as testing data, which includes 103 
targets in total (http:/ /predictioncenter.org/). 

For further training and testing, another two datasets, the 
SCOP1.75-Z6 as training data and SCOP1. 75-500 as testing data 
were constructed from SCOP1.75 [27], independendy. The 
SCOP1.75-Z6 set was constructed as follows. Firstly, 1401 
domains over 1195 fold classes were selected uniformly as the 
size of fold class. Then, 252 targets in the dawn region (optimal Z- 
score <6.0 for FR-t5) were kept. Similarly, the SCOP1. 75-500 set 
consists of 500 domains covering 307 folds was built. Notably, a 
major difference between the two data sets is that the SCOP1.75- 
Z6 set only includes targets in the dawn region, while the 
SCOP1. 75-500 set is a comprehensive set that consists of high- 
confidence targets, as well as targets in the dawn region. The 
SCOP1.75-Z6 and SCOP1. 75-500 data set were available at 
http://jianglab.ibp.ac.cn/lims/MEFTop/ meftop.html. 

For each protein in training and testing data, 50 structural 
models were generated by FR-t5. 

Feature Extraction and SVM Predictor 

MEFTop was developed as an SVM predictor that considered 
37 features classified into four groups: (1) 1 -dimensional (ID) and 
(2) 2-dimentional (2D) contact map features, (3) Secondary 
Structural Element (SSE) contact features and (4) 3-dimensional 
topology features. 

ID features included secondary structure (SS) represented by 
helix, strand and coil and relative solvent accessibility (RSA) 
computed as exposed and buried states. For a target sequence, its 
SS state and RSA state for each residue were predicted by 
SCRATH [28]. For each structural model of the target sequence, 
the SS state and RSA state were calculated for each residue with 
DSSP [29]. Then the percentages of residues of the three SS states 
(helix 0 /), strand% and coil%) and of the two RSA states 
(exposed%, buried%) were calculated over all the residues for 
both target sequence and its structural models. Thus we obtained 
10 ID features for both sequence and structural models. Based on 
these ID features, four similarity scores between the target 
sequence and its structural model were derived by following Wang 
and colleagues' work [21]. More specially, the ID features (the 
percentages of helix, strand, coil, exposed and buried) of target 



Table 3. Improvements of FR-t5-M over FR-t5 for low- 



homology targets 


on SCOP1.75-500 set. 






Seq-40%" 


Seq-30%" 


Metrics 


Ave-Rank b Top1% e 


Ave-Rank b Top1% c 


Z-score 


1 7.72 42.4 


17.80 32.0 


M-score 


13.49 52.5 


16.32 40.0 



a The sequence identity. 

b The average rank according to TM-score in the absence of native structures. 
c The Top1% is the fraction of native-like Topi models for all targets. 
doi:1 0.1 371 /journal.pone.0089935.t003 



sequence and its structural model can be regarded as two 
composition vectors. The cosine, correlation, Gaussian kernel, 
and dot products of the two composition vectors were calculated as 
four similarity scores, namely 4 features. In total, there were 14 
features derived as ID features. 

2D contact map features capture contact information between 
residues with separation a 6 residues at two distance thresholds (< 
8A and <12A) between the side chain center of mass (SCM) [21]. 
For a target sequence, the contact probability of each residue pair 
was predicted by SCRATH, while the information about a residue 
pair in contact or not was readily extracted from structural models. 
Then for each residue in target sequence or structural models, its 
contact order and contact number were calculated as 
E|i-/|>=6 C g\i—J\ and E|i-7|>=6 Q respectively (Q is the 
predicted contact probability from target sequence or extract 
contact information from structural models for residues i andj). 
Thus, the residues contact order of target sequence and its 
structural model can be regarded as two composition vectors. The 
cosine and correlation of the two composition vectors were 
calculated as two similarity scores at a distance threshold. 
Similarly, another 2 similarity scores were obtained for contact 
number. 

In addition, the overall match score (/^ es ) of the contact 
probability between target sequence and structural model was 
calculated as the following equation: 

, _ Y!ij=l C ii N 'i m 

J res— TV \ ' 

Here, n is the length of sequence. For residues i andj, C$ is the 
predicted contact probability and JVjj is the contact value from 
structural model (1 is in contact and 0 is in isolation). 



Table 4. Performances of FR-t5-M and FR-t5 on CASP10 set. 





All 




Dawn region 3 


High-confidence b 




Ave- 


Ave- 


Ave- 


Ave- 


Ave- 


Ave- 


Metrics 


Rank c 


TM d 


Rank c 


TM d 


Rank c 


TM d 


Z-score 


10.46 


0.564 


14.15 


0.449 


2.79 


0.803 


M-score 


9.00 


0.570 


12.08 


0.458 


2.58 


0.802 



a 57 proteins whose optimal Z-score <6.0. 
b 46 proteins whose optimal Z-score > = 6.0. 

c The average rank according to TM-score in the absence of native structures. 
d The average of TM-scores for Topi models. 
doi:1 0.1 371 /journal.pone.0089935.t004 
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Table 5. Improvements of RaptorX-M over RaptorX in the 



dawn region 


on SCOP1. 75-500 


set. 






Method 


Top1% a 


Sum" 


CC ±o c 


Average* 1 


RaptorX 


76.0 


69.95 


0.572±0.324 


13.88 


RaptorX-M 


78.8 


71.4 


0.581 ±0.320 


11.81 





a The Top1% is the fraction of native-like Topi models for 104 targets in the 

dawn region whose optimal Z-score(FR-t5) is less than 6.0. (remove 6 targets 

which could not get complete models by RaptorX). 

b The sum of TM-scores for Topi models in the dawn region. 

c The average and standard deviation of Pearson correlation coefficients 

between predicted score and TM-score for every target in the dawn region. 

d The average rank according to TM-score(over 104 decoy sets, remove 6 targets 

which could not get complete models by RaptorX) in the absence of native 

structures. 

doi:1 0.1 371 /journal.pone.0089935.t005 

Therefore, to describe the extent of correspondence between a 
target sequence and its structural model, ten features including 
eight similarity scores for contact order and contact number and 
two overall match scores were derived at two distance thresholds. 
In total, 24 traditional ID and 2D features were generated. 

SSE contact features capture the information of SSE spatial 
relationship including the SSE pairs in contact, the distances 
between SSEs and the SSE lengths. Based on the SS states of 
residues calculated above, an SSE was identified as a segment 
consisting of at least 4 continuous residues with helix or strand 
state. Figure 4A illustrated the cartoon representation of two 
contacts between two pairs of beta strands. For structural models, 
the contact strength of two SSEs was computed as the number of 
residues pairs in contact (distance threshold <8.5A). For a target 
sequence, the contact strength of two SSEs was computed as the 
sum of their residues contact probability (threshold <8A). An SSE 
of a structural model was considered to be corresponding to an 
SSE of the target sequence, if the two SSEs have minimum 
difference in the starting residue position according to the 
sequence order. Only the SSEs that have correspondence in both 
structural model and target sequence were considered in the 
following calculations. Then, the overall match score (/sse) of the 
SSE contact strength between structural model and target 
sequence was calculated in the following equation: 

, Ey=i c s(u) N sm) M , 

J SSE - ^ \ l ) 

Z^y=l N S(u) 

Here, n is the total number of corresponding SSEs between a 
structural model and target sequence. Csfjj) is the predicted contact 
strength of SSE i and j from target sequence divided by the length 
of SSE i, and Ns/W is the contact strength between SSE i and j 
extracted from structural model divided by the length of SSE i. 

Two composition vectors of SSE contact numbers were 
generated, respectively, for the structural model F M = \Pm(1), 

n 

Pm(2),---, P M(nA ( p M(i)= J2 N S(ii) is the sum of contact strength 
for SSE i in the structural model ) and the target sequence 

n 

Ft- [Pt(1), Pt(2J,--, Ptm] (Pt®= J2 c S(ij)is the sum of contact 

strength for SSE i in the target sequence) and then transformed 
into similarity scores using the cosine and correlation function. 

The overall match of the distances of SSE pairs between the 
structural model and target sequence was also considered. First, 



Table 6. Improvements of SPARKS-X-M over SPARKS-X in the 



dawn region 


on SCOP1.75-500 set. 






Method 


Top1% a 


Sum" 


CC ±a c 


Average* 1 


SPARKS-X 


70.9 


68.64 


0.518±0.351 


13.05 


SPARKS-X -M 


73.7 


69.93 


0.587±0.301 


9.77 


a The Top1% is the fraction of native-like Topi models for 110 targets in the 



dawn region whose optimal Z-score(FR-t5) is less than 6.0. 
^he sum of TM-scores for Topi models in the dawn region. 
c The average and standard deviation of Pearson correlation coefficients 
between predicted score and TM-score for every target in the dawn region. 
d The average rank according to TM-score(over 1 1 0 decoy sets) in the absence of 
native structures. 

doi:1 0.1 371 /journal.pone.0089935.t006 

the distance of an SSE pair in structural model was assigned with 
the minimum distance between residues of this SSE pair, and the 
distance of SSE pair for the target sequence was estimated from its 
residue predicted contact probability as follows: 

D= [ Dm ^[0.0,0.1] (3) 

\-k\og(p-p 0 ) + D 0 /kKO.1,1.0] 

Here, D is the predicted distance of a SSE pair, p is the 
maximum predicted contact probability between the residues of a 
SSE pair, D m is the distance threshold, k is a constant, and Po and 
D 0 denote ideal status values. Then, the similarity score of SSE 
pair distance between the structural model and target sequence 
was calculated as by following equation S3 (see Methods SI). 

The length of the corresponding SSEs between the structural 
model and target sequence was compared and transformed into 
two different ratios by equation S6 and S7 (see Methods SI). As 
seen from above, 6 SSE contact features were generated, including 
one overall match score (/sse) °f the SSE contact strength, two 
similarity scores for SSE contact numbers, one similarity score of 
SSE pair distance, and two different ratios of SSE lengths. 

As shown in Figure 4B and 4C, the topology features were 
generated from radius of gyration, Hydrophobic Core (HG) and 
local conformation potential of all fragments for a structure model. 
To capture the topology compactness, the radius of gyration for 
each structural model was calculated (Figure 4B). On the other 
hand, the radius of gyration could be predicted based on the 
length of the target sequence according to the following equation 

Table 7. Improvements of RaptorX-M and SPARKS-X-M for 
low-homology targets on SCOP1. 75-500 set. 



Seq-40% a Seq-30%" 



Method 


Top1°/o b 


Ave-Rank c 


Topi" 


'o b Ave-Rank c 


RaptorX 


63.2 


20.53 


54.2 


25.38 


RaptorX-M 


66.7 


17.07 


62.5 


19.38 


SPARKS-X 


47.5 


15.31 


32.0 


20.64 


SPARKS-X-M 


49.2 


10.85 


40.0 


11.52 



a The sequence identity. 

b The Top1% is the fraction of native-like Topi models for all targets. 

c The average rank according to TM-score in the absence of native structures. 

Note: Ave-Rank is only compared between a pair of methods (RaptorX/RaptorX- 

M and SPARKS-X/SPARKS-X-M). 

doi:1 0.1 371 /journal.pone.0089935.t007 
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o 



B D 



Traditional Novel 
Feature Feature 




Figure 4. The overview of MEFTop. (A) The cartoon representation of two contacts between two pairs of SSEs (beta strands). (B) The radius of 
gyration for the model structure as one of the topology features. (C) Hydrophobic core and local conformation potential based on residue fragments. 
Schematic representation of the backbone atoms (N CA C 0) and the side chain center of mass is shown. (D) The SVM predictor. Four groups of input 
features: traditional sequence (1D) and contact map (2D) features and two groups of newly introduced structural features including SSE contact 
features and topology features. 
doi:1 0.1 371 /journal.pone.0089935.g004 



[30]: 

R = kxL m (4) 

Here, R is the predicted radius of gyration, L is the length of 
sequence, k and m are constant parameters. The radii of gyration 
predicted from the target sequence and extracted from the 
structural model were compared and transformed into two 
similarity scores by equation S8 and S9 (see Methods SI). 

Besides radius of gyration constraints, some local interactions 
played important roles in protein folding and topology stability, 
such as hydrophobic interaction. Thus, specific local hydrophobic 
residue clusters were defined as Hydrophobic Core (HC), and the 
HC is a new structural descriptor (Figure 4C). The radius, the 
number of hydrophobic residues and the number of SSEs in HC 
were compared to those in structural model, and transformed into 
three 3D topology features. In addition, potentials from the local 
conformation of fragments (Figure 4C) [3 1] were also used as 2 
features. In total, 7 features were obtained for describing the 3- 
dimensional topology. 

Figure 4D illustrates the use of SVM predictor as a core 
component of MEFTop to evaluate model quality. The SVM 
predictor takes as inputs the traditional 1 D and 2D residue contact 
map features and two groups of additional structural features. 
Thus MEFTop represents a novel model evaluation and selection 
program with focus on predicting the similarity in topology 
between a predicted model and its native structure. 



Evaluation Score 

In the FR-t5 program, the Z-score was applied for template 
ranking, and could also be used to assist in the selection of the 
optimal structural model. The raw score N score oi the FR-t5 scoring 
function [7], which is positively correlated with the quality of the 
alignment between query and template sequences, was trans- 
formed into a Z-score as follows: 



W score 1" score f c \ 

Z — score = = (5) 

V ^score — N score 



Here N score is the average of N score , and Nf core is the mean 
square of N score . 

In MEFTop, a P-score was used to evaluate the quality of a 
structural model through a SVM regression function fi x ) as 
follows: 

P-score=f {x) = ^2(a.i-a*)K ( ^ Xi) + b (6) 

Xj(ES 

Here, the value computed by /( x ) is the estimate of the TM- 
score associated with an input feature vector x of a model, a and 
a* are non-negative weights assigned to the training data point Xi, 
and they control the trade-off between training errors and the 
smoothness of f(x) during training [32]. b represents the bias term. 
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K is the kernel function, which could be viewed as a function to 
compute the similarity between the training data point x, and the 
target data point x. The function related parameters were 
optimized on the training set. 

In order to form the new modeling program FR-t5-M, 
MEFTop was combined with FR-t5. A new metric called M-score 
was then used as follows: 

M — score = Z — score + n x P — score ( 7) 



Here, n is the weight for P-score. 

Training and Testing 

MEFTop was firstly trained and evaluated as a general model 
evaluation method in our research. The training dataset was 
CASP7-8 set, which is generated from 22 1 targets of CASP7 and 
CASP8 with FR-t5. Furthermore, in order to adapt this method 
for targets in the dawn region, MEFTop was optimized using the 
SCOP1.75-Z6 set. First, the weight for the vector of this structural 
model was assigned according to its TM-score, and the weights 
and features were used as inputs for the software LIBSVM [33]. 
Basically, a bigger TM-score represents a larger weight. Subse- 
quently, the SVM predictor was trained and optimized with a cost 
function (F) as follows: 

F = N„+nxZ + N,„ (8) 

Here, N n is the average rank of native structure, Z is the 
average Z-score in SVM (£-score SVM ) for training target, n is the 
weight of ^-score SVM and N m is number of missed proteins whose 
native structures have not been ranked 1st. The optimization goal 
was to minimize the cost function value. To evaluate the 
robustness of the SVM predictor, a 5-fold test for the dataset 
SCOP1.75-Z6 was carried out. 

After training of MEFTop using the above process, the 
performance of MEFTop was tested on the SCOP1. 75-500, with 
particular focus on targets in the dawn region. Mostly, two criteria 
were used to evaluate the performance of evaluation method. 
They include the percentage of native-like Top 1 models (Top 1 %) 
and the average rank of Topi models. The Topl% is the fraction 
of native-like Topi models for all targets. If the TM-score of a 
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